Skip to content

Conversation

@pedro-psb
Copy link
Member

@pedro-psb pedro-psb commented Feb 4, 2026

Added a proper error class for duplicate content handling and some more logging to inform exactly what are the conflicting content.

When duplicates are detected, we do some extra work to collect duplicate content.
A simple performance test shows it's not too bad:

In [1]: import pulpcore.plugin.repo_version_utils as ut

In [2]: content_qs = Package.objects.all()

In [3]: unique_keys = Package.repo_key_fields

In [4]: content_qs.count()
Out[4]: 24547

In [5]: ut.count_duplicates(content_qs, unique_keys)
Out[5]: 7701

In [6]: %timeit ut.count_duplicates(content_qs, unique_keys)
35.1 ms ± 154 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit ut.collect_duplicates(content_qs, unique_keys)
61.4 ms ± 266 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Closes: #7184

📜 Checklist

  • Commits are cleanly separated with meaningful messages (simple features and bug fixes should be squashed to one commit)
  • A changelog entry or entries has been added for any significant changes
  • Follows the Pulp policy on AI Usage
  • (For new features) - User documentation and test coverage has been added

See: Pull Request Walkthrough

@pedro-psb pedro-psb force-pushed the fix/7184-report-conflicting-packages branch from 26e2e75 to 7e9e749 Compare February 4, 2026 14:42
@pedro-psb pedro-psb force-pushed the fix/7184-report-conflicting-packages branch from 7e9e749 to a42e341 Compare February 4, 2026 14:45
@pedro-psb pedro-psb changed the title Add better error handling for repover duplicate content [PULP-1118] Add better error handling for repover duplicate content Feb 4, 2026
"""

def __init__(self, duplicate_count: int, correlation_id: str):
self.dup_count = duplicate_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would seem like this assumes that a RepositoryVersion creation failed due to duplicates always? Do we want to assume that? Should the error be more specific?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense. I'll make more specific one

def log_duplicate(pulp_type: str, duplicate: DuplicateEntry):
keyset_value = duplicate.keyset_value
duplicate_pks = duplicate.duplicate_pks
_logger.info(f"Duplicates found: {pulp_type=}; {keyset_value=}; {duplicate_pks=}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there are particular reason to separate this into its own function if it's only used in one place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the abstraction. At a glance, it's trivial do read the main validate function and understand what it does, and dig into the functions if you are interested in implementation details.
If the call cost is not relevant (i believe it isnt here), then it is just a matter of style. I can undo that, if you prefer..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a matter of style, but if only one line is doing work here, I'd rather just inline it personally

I do not consider it a blocker if you have strong feelings about it or if there are plans to re-use it in the future, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll inline the log one. I'll keep the count and collect duplicates because it was at least convenient for testing them in isolation.

def count_duplicates(content_qs, unique_keys: tuple[str]) -> int:
new_content_total = content_qs.count()
unique_new_content_total = content_qs.distinct(*unique_keys).count()
return new_content_total - unique_new_content_total
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is at least more sensible than the other one though, since there's more than one line that's actually doing something

Added a proper error class for duplicate content handling and some more
logging to inform exactly what are the conflicting content.

Closes: pulp#7184
@pedro-psb pedro-psb force-pushed the fix/7184-report-conflicting-packages branch from abc7a12 to c43e01d Compare February 6, 2026 17:25
@pedro-psb pedro-psb enabled auto-merge (rebase) February 6, 2026 17:45
keyset_value = duplicate.keyset_value
duplicate_pks = duplicate.duplicate_pks
_logger.info(f"Duplicates found: {pulp_type=}; {keyset_value=}; {duplicate_pks=}")
if dup_count > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic is messed up a bit.
If you have two types in the loop and the first one has duplicates the second is fine, this will not raise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[PULP-1118] Sync failures due to NEVRA duplicates should also report the conflicting packages

3 participants